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Abstract 

Many learning machines such as normal mixtures and layered neural 
networks are not regular but singular statistical models, because the map 
from a parameter to a probability distribution is not one-to-one. The con- 
ventional statistical asymptotic theory can not be applied to such learning 
machines because the likelihood function can not be approximated by any 
normal distribution. Recently, new statistical theory has been established 
based on algebraic geometry and it was clarified that the generalization 
and training errors are determined by two birational invariants, the real 
log canonical threshold and the singular fluctuation. However, their con- 
crete values are left unknown. In the present paper, we propose a new 
concept, a quasi-regular case in statistical learning theory. A quasi-regular 
case is not a regular case but a singular case, however, it has the same 
property as a regular case. In fact, we prove that, in a quasi-regular 
case, two birational invariants are equal to each other, resulting that the 
symmetry of the generalization and training errors holds. Moreover, the 
concrete values of two birational invariants are explicitly obtained, the 
quasi-regular case is useful to study statistical learning theory. 

1 Introduction 

A lot of statistical learning machines which are being applied to pattern recog- 
nition, bioinformatics, robotic control, and artificial intelligence have hidden 
variables, hierarchical layers, and submodules, because they are used to esti- 
mate the structure of the true distributions. In such learning machines, the 
map taking parameters to probability distributions is not one-to-one and the 
Fisher information matrices are singular, hence they are called singular learning 
machines. For example, three-layered neural networks, normal mixtures, hidden 
Markov models, Bayesian networks, and reduced rank regressions are singular 
learning machines [Tl [31 HI El IH1 [XO] ■ If a statistical model is singular, then cither 
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the maximum likelihood estimator is not subject to the normal distribution even 
asymptotically or the Bayes posterior distribution can not be approximated by 
any normal distribution. Hence it has been difficult to study their learning 
performace and to estimate the generalization error from the training error. 

Recently, new statistical theory has been established based on algebraic ge- 
ometrical method [TTJ [TSl E3 [H] and it was clarified that the generalization 
and training errors in Bayes estimation, G n and T n , are given by two birational 
invariants, the real log canonical threshold A and singular fluctuation v by the 
formulas, 



E[GU 



\ p in n 



Epy = (^-,)i + 0( i), (2 ) 

where E[ ] shows the expectation value over all training sets, n is the number 
of training samples and /3 is the inverse temperature of the Bayes posterior 
distribution. Based on this relation, we can define an information criterion 
which enables us to estimate the generalization error from the training error 

P3- 

It is well known that, if the true distribution and the statistical model are 
in a regular case, then A = v = d/2 holds where d is the dimension of the 
parameter space. In this case, the symmetry of the generalization and training 
errors holds, 

E[G B ] = A + (i), ( 3 ) 

for arbitrary < ft < oo. This case corresponds to the well-known Akaike 
Information criterion for regular statistical models. However, if they are not in 
a regular case, neither of them is equal to d/2 in general. Therefore, in order 
to study singular learning machines, researches on two birational invariants are 
necessary. 

In the present paper, in order to investigate the mathematical structure of 
birational invariants, we firstly introduce a new concept, a quasi-regular case, 
which satisfies the relation, 

Regular C Quasi-Regular C Singular. 

In other words, a quasi-regular case is not a regular case, however, it has the 
same properties as the regular case. In fact, we prove that, in quasi-regular 
cases, both birational invariants are equal to each other, A = v, and the sym- 
metry of the generalization and training errors holds. In a quasi-regular case, 
two birational invariants are obtained explicitly, hence it is a useful concept in 
researches of statistical learning theory. 
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2 Framework of Bayes Learning 



In this section, we summarize the framework of the Bayes learning, and intro- 
duce the well-known results. 

2.1 Generalization and Training Errors 

Firstly, we define the generalization and training errors. Let N, n and d be 
natural numbers. Let X\, X2, X n be random variables on M. N which are 
independently subject to the same probability density function as q(x). Let 
p(x\w) be a probability density function of x for a parameter w € W C M d , 
where W is a set of parameters. The prior distribution is represented by the 
probability density function ip(w) on W. For a given training set 

X n = {Xi,X 2 , x n }, 

the posterior distribution is defined by 



p(w\X n ) = 




where < j3 < 00 is the inverse temperature and Z n is the normalizing constant. 
The case /3 = 1 is most important because it corresponds to the strict Bayes 
estimation. The expectation value over the posterior distribution is denoted by 

E„,[ ] = J( )p(w\X n )dw. 

The predictive distribution is defined by 

p{x\X n )=E w \p(x\w)}. 
The generalization and training error, G n and T n , are respectively defined by 

I q{x)l ° g p^) dX ' 
1 ^ qjXj) 

The generalization error shows the Kullback-Leibler distance from the true dis- 
tribution to the estimated distribution. The smaller the generalization error is, 
the better the learning result is. However, we can not know the generalization 
error directly, because calculation of G n needs the expectation value over the 
unknown true distribution q(x). On the other hand, the training error can be 
calculated using only training samples, in practice, as the log likelihood func- 
tion. Hence one of the main purposes of statistical learning theory is to clarify 
the mathematical relation between them. 



G n — 
T — 
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2.2 Two Birational Invariants 

Secondly, wc define two birational invariants. 

The Kullback-Lcibler distance from the true distribution q(x) to a parametric 
model p(x\w) is defined by 

K{w)= [ q( x )log^-dx. 
J p(x\w) 

Then K(w) — if and only if q(x) — p(x\w). In this paper, we assume that 
there exists a parameter u>o which satisfies q(x) = p(x\wo) and that K(w) is an 
analytic function of w. 

Definition 1 (Real Log Canonical Threshold) The zeta function of statis- 
tical learning is defined by 

C(z) = J K(w) z tp(w)dw. 

Then is a holomorphic function on the region Re{z) > 0, which can be 
analytically continued to the unique meromorphic function on the entire complex 
plane JZ2]/. All poles of the zeta function are real, negative, and rational numbers. 
If its largest pole is (—A), then the real log canonical threshold is defined by A. 
The order of the pole z = — A is referred to as a multiplicity m. 

Definition 2 (Singular Fluctuation) The functional variance is defined by 

n 

V n - ^{E^logppQH) 2 ] -E^logpp^N] 2 }- 

i=l 

Then it was proved Jiffi that the expectation value 

u=^- lim E[V n ) 

Z n— >oo 

exists. The constant v is called the singular fluctuation. 

Theorem 1 The expectation values of the generalization and training errors 
are given by eq. {Tp and eq.^j. Therefore 

E[G„]=E[T n ] + — + o(-). 

n n 

(Proof) This theorem was proved in QUE]. (Q-E.D.) 

Remarks. (1) The real log canonical threshold and the singular fluctuation are 
invariant under a birational transform 

w = g(w'), 
p(x\w) ^ p(x\g(w')), 
tp(w) h-> <p(g(w'))\g'{w% 
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where |g'(w/)| is the Jacobian determinant. Such constants are called birational 
invariants. 

(2) The real log canonical thresholds for several learning machines were clari- 
fied [203 IS] using resolution of singularities. However, the singular fluctuation 
has been left unknown. This paper provides the first result which clarifies the 
concrete values of singular fluctuation in a singular case. 

(3) The real log canonical threshold is a well known birational invariant in al- 
gberaic geometry, which plays an important role in higher dimensional algberaic 
geometry. The singular fluctuation was found in statistical learning theory. 

2.3 Regular and Singular 

Thirdly, we define regular and singular cases. 

Definition 3 A pair of the true distribution and the parametric model, (q(x),p(x\w)) 
is called to be in a regular case if and only if the set {w; q(x) = p(x\w)} consists 
of a single element Wq and Fisher information matrix 



is positive definite. Otherwise, it is called to be in a singular case. 

For a regular case, the real log canonical threshold and the singular fluctu- 
ation have been completely clarified. 

Theorem 2 If a pair (q(x) , p{x\w)) is in a regular case, then A = v = d/2, 
where d is the dimension of the parameter. 

(Proof) This theorem was proved in [15] , (Q.E.D.) 



In this section, we define a quasi-regular case. This concept is firstly proposed 
by the present paper. Also the main theorem is introduced. 

Definition 4 iQuasi-Regular Casej. Assume that there exists a parame- 
ter wq € W° such that q{x) = p(x\wo). Without loss of generality, we can 
assume that Wo is the origin u>o — 0. The original parameter is denoted by 
w = (w\, W2, Let g and Adi, Ad 2 , A.d g be natural numbers which 

satisfy 

Adi + Ad 2 H h Ad g = d 

and Ado — 0. We define 

dj ^Ado + '-' + Adj (j = (),••• ,g) 




3 Main Results 
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and a function u — (ui, U2, ■ u g ) £ M. 9 of the paramater w £ M. d by 

"i = n^> 

j=di + l 

_ ' ' ' , 

dg 

% = n w j- 

If there exist constants ci, C2 > swc/i £/ia£, /or arbitrary w £ W , 

ci(u? + • • • + u*) < i^H < c 2 (u? + • • • + u*), 
</ien </ie pair (g(a;),p(x|u))) is called to be in a quasi-regular case. 

Remark. (1) If g = d, then 

{w;q(x) =p(x\w)} = {0} 

and the quasi-regular case corresponds to the regular case. Hence a quasi-regular 
case contains a regular case as a special one. 

(2) If d 7^ g, then "K(w) = <^=> w = 0" does not hold, because, for at least 
one variable Wj, K(0, 0, .., Wj, 0, .., 0) = 0. Hence a quasi-regular case with d ^ g 
is not a regular case but a singular case. 

(3) There are singular cases which are not contained in quasi-regular cases. 
Therefore, 

Regular C Quasi-Regular C Singular 

holds. The present paper shows in Theorem [3] that a quasi- regular case is not a 
regular case, however, it has the same property as a regular case. 

Example. 1 Let a statistical model be 

p(x, y\w) — j= cxp(— — {y — ax 2 — 6tanh(ca;)) 2 ), 

where w = (a, b, c) is the parameter and rix) is the probability density function 
of x. If the true distribution is given by q(x, y) = p(x, y\0, 0, 0), then by using 

Mi = a, 112 = be, 

it follows that 

K(w) = — / (ax 2 + 6tanh(ca;)) 2 r(a;)o!a; 
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satisfies the condition for a quasi-regular case with g — 2, because x 2 and 
tanh(cx)/c is linearly independent. In fact there exist ci,C2 > such that 

ci(a 2 + (be) 2 ) < K(w) < c 2 (a 2 + (be) 2 ). 

Hence the set of true parameters consists of the union of two lines, 

{w; q(x, y) = p(x, y\w)} = {a = 0,bc = 0}. 

Example. 2 Let a statistical model be 

p(x,y\w) = ^=cxp(-i(?y-aa;-6tanh(ca;)) 2 ), 

where w — (a, b, c) is the parameter and the true distribution is given by 
q(x,y) — p(x,y\0,0, 0). Then because x and tanh(cx)/c is not linearly inde- 
pendent as c -> 0, hence this case does not satisfies the quasi- regular condition. 
In this case 

ci((a + be) 2 + b 2 c 6 ) < K(w) < c 2 ((a + be) 2 + b 2 c 6 ). 

Example. 2 resembles Example. 1, however, from the viewpoint of statistical 
learning theory, they arc different. 

Example. 3 Let a statistical model be 



p(x,y,z\w) = -j= cxp(--(z - f(x,y,w)) ), 



where 



f(x,y,w) — ais'm(bix) + a 2 xsm(b2x) 

+a 3 sin(6 3 y) + a 4 y sin(6 4 y), 

and w — {(cii,bi)} is the parameter and the true distribution is given by 
q(x,y,z) = p(x,y, z\0). Then (q(x, y, z),p(x, y, z\w)) is in a quasi-regular case 
with g = 4. 

The following is the main theorem of the present paper. 

Theorem 3 (Main Theorem). Assume that the pair (q(x) , p(x\w)) is in a 
quasi-regular case and that <p(w) > on W. Then the real log canonical threshold 
and the singular fluctuation are given by 



m = d — g + 1 . 
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Corollary 1 Assume that the pair (q(x) , p(x\w)) is in a quasi-regular case and 
that <p(w) > on W . For arbitrary < /? < oo the symmetry of the generaliza- 
tion and training errors holds, 



E[G„] 



E[T n ] 




Remarks. (1) The above theorem shows the generalization and training errors 
for Bayes estimation. In the quasi-regular case, they have the same property 
as those in regular cases, however, the generalization and training errors of the 
maximum likelihood estimation is different from regular case in general. 
(2) In the maximum likelihood method, the training error of a singular case is 
far smaller than that of a regular case, whereas the generalization error of a 
singular case is far larger than that of a regular case. From the viewpoint of the 
maximum likelihood method, the quasi-regualr case is contained in the singular 
case. In the present paper, we prove that the quasi-regular case has the same 
property as the regular case from the viewpoint of the Bayes estimation. 

4 Proofs 

In this section, we prove the main theorem. At first, we derive the real log 
canonical threshold of the quasi-regular case. 

Lemma 1 The real log canonical threshold and its order are given by A = g/2 
and m = d — g + 1 respectively. 

(Proof) Since each function {uj; j — 1, 2, ...,g} does not have common variable 
Wk , the real log canonical threshold is given by the sum of individual real log 
canonical thresholds (Remark 7.2 in [TS]) defined by 



Hence A is equal to g times 1/2, hence A = g/2. The multiplicity is also given 





c 



(z + l/2) d i- d i-i 



+ 



by 



m 



di + <h — di -\ 

d-g + l, 



+ 



dg - dg-1 ~(g-l) 



which shows the Lemma. (Q.E.D.) 
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Definition 5 For a given pair of the true distribution q(x) and the parametric 
model p(x\w), the log density ratio function is defined by 

f(x,w) = log- 



p(x\w) 



The following lemma shows that the log density ratio function of the quasi- 
regular case is represented by g linearly independent functions. 

Lemma 2 Assume that the pair (q(x) , p{x\w)) is in a quasi-regular case. Then 
there exists a set of functions {ej(x,u);j — 1,2, ...,g} which are analytic func- 
tions of u and 

g 

f(x,w) =^2u j e j (x,u) 

in an open neighborhood ofu — 0. 
(Proof) Let us define a function 

F(t) = t + e~ t -l. 

for t € M 1 . Then F(0) = 0, F'(0) = 0, and F"(0) = 1, resulting that F(t) > 
and that F(t) = if and only if t = 0. Moreover, F(t) = (l/2)t 2 for small \t\. 
Therefore, 

K(w) = ( q{x)F(\og^-%)dx 
J V p(x\w)J 

= J q(x)F(f(x,w))dx 

= \J q{x)f{x,w) 2 dx. (5) 

By the assumption of the quasi-regular case, K(w) = if and only if u\ = U2 = 
■ ■ ■ = u g = 0, which is equivalent to f(x,w) = 0. That is to say, f(x,w) is 
contained in the ideal of analytic functions generated by u\, u 2 , u g . Hence 
there exist a set {ej(x, u)} of analytic functions of u, which satisfies 

9 

f(x,w) = ^2u j e j (x,u). 
i=i 

Therefore, we obtained the Lemma. (Q.E.D.) 

In the following lemma, we show that the quasi-regular case has the generalized 
Fisher information matrix. 

Lemma 3 The g x g matrix I{u) is defined by 

Iij(u) = J q(x)ei(x,u)ej(x,u)dx. 
Then I(u) is positive definite in an open neighborhood of u = 0. 
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(Proof) By Lemma [5] and eq.([5]), in the neighborhood of it = 0, 

K(w) — -(u ■ I(u)u). 

By the condition of the quasi-regular case, 

a 

3=1 

Hence the minimum eigenvalue of I(u) is positive, which shows I(u) is positive 
definite. (Q.E.D.) 

The following definition and lemma show that the empirical loss function of the 
quasi-regular case has the same decomposition as that of the regular case. 

Definition 6 A random process £ ra (it) G K 9 is defined by 

1 " 1 

£„(u) = — yV-J(V)u- e{X u u)} 
V n 2 

V 2—1 

where 



e(x,u) = (ei(x, u),e 2 {x,u), ...,e g (x, u)) J 
Lemma 4 The empirical loss function defined by 

1 - 

K n (w) = -Y i f(X i) w) 



n 

i=l 



is represented by 



K n {w) = \{uj(u)u) \= u ■ £ n (it) 

2 Jn 



in the neighborhood of u = 0. Moreover, the random process £ n (it) converges to 
the gaussian process £ (it) that satisfies 

EK(0).J(0)-^(0)]= 5 . 

(Proof) The empirical loss function is given by 

K n (w) = K(w) - - J2{K(w) - f(X h w)}. 

i=l 

By combining this equation with the definition of £ n (it), the first half of the 
Lemma is obtained. For the second half, the convergence £ n (u) is derived from 
the general empirical process theory. Moreover, 

= E[tr(/(0)- 1 ^(0)^(0) T )]=.g, 
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where we used the covariance matrix of £ n (0) 

E[£„(0)£„(0) T ] = / q(x)e(x,0)e(x,0) T dx = 1(0), 



which completes the Lemma. (Q.E.D.) 

In the quasi-regular case, the relation between w — (wi, W2, wa) and u = 
(ui,U2, u g ) is important. The following lemma shows the property of the 
quasi-regular case. This lemma does not hold in general singular cases. 

Lemma 5 When n tends to infinity, 

lK-%- ft ^^(tognr- 1 !!^) 

j=l v j=l 

where m = d — g + 1 and c 3 > is a constant. 

(Proof) Firstly, we prove that the delta function with variables x = (xi,x 2 , xj) 
in M = [0, l] d 

D(t, x) = S(t — xix 2 ■■■x d ) 
has asymptotic expansion for t — > 0, 



k=l 



+ ((-logt) d - 2 ). (6) 
Let </>(x) be an arbitrary C°°-class function of x whose support is containd in 

D t (4>)= [ D(t,x)^(x)dx. 



Then its Mellin transform is 



/ D t {<f)t z dt = / T[(xi) z 4>(x)d 



where M is the compact set that is the support of </>. Without loss of generality 
By using Taylor expansion 

0(x) = 0(0) + x • V0(O) + • • • , 

we have the asymptotic expansion, 



11 



Therefore 



for x G [0, By using inverse Mellin transform, we obtained eq.©. Secondly, 
let us prove the Lemma. By using eq.(j6|), for each Uj, 

{ (^- n 

j=d j _ 1 +l 

when n — > oo. By summing up these relations for j — 1,2, ...,<?, Lemma is 
obtained. (Q.E.D.) 

Let us return to the proof of the Main theorem. 

(Proof of Main Theorem) It was proved by eq.(6.4) in jTS] that the expec- 
tation value of K n (w) is given by two birational invariants, 

E[E w [K n (w)}} = A--+o(i). 

np n n 

Since we have already obtained the value of A in Lemmal, that is to say, A = 
g/2, we can derive the value of v by calculating E[E„, [K n {w)}\. The posterior 
distribution is represented by the empirical loss function by 

p(w\X n ) oc exp(— nj3K n {w))tp{w)dw . 

The integration of the outside of the neighborhood of u = with respect to 
the posterior distribution goes to zero with the smaller order than exp(— i/n.) 
as Lemma 6.3 in |15j . hence we can restrict the integrated region to the neigh- 
borhood of u — 0. The empirical loss function is rewritten as 



K n (w) = i||I(„)*(u-J(u)-^ 



-^■«n(«)--f(«) -1 ^(«)). 

2n 

In the neighborhood of u — 0, we obtain 



K n (w) - -||J(0)*(u-I(0)- 
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For an arbitrary function F( ), 

F(y/n u)dw 



J i=i fe=d J _i+i 

i=i v fe=d 3 _i+i 



dwdu 



On the other hand, 

*m - \\\m y2 (yfc«-m-Hn{V))\ 

-^(en(0)-/(0)-^ n (0)) 



Therefore, 



= K n (y/nu). 

E w [K n (w)\ = * f ? 

1 / K„(y/n u) exp(—/3K n (y/n u))tp(w)dw 
n J exp(—^K n (y/n u))ip(w)dw 

_ 1 / K n {u) cxp(-f3K n (u))du 

~ n J exp(-/3K n (u))du 

_ 1 J\\I(0)Hu-C)\\ 2 cM-PK n (u))du 
2n Jexp(-/3K n (u))du 

-^(^(0)-J(0)-^ n (0)), 

where the notation 

is used. Finally, by the integral formlula 

J\\I(0)^u\\ 2 cxp(-l\\I(0)^ur)du _ g 
Jexp(-|||/(0) 1 /2 u ||2) du p 

and by Lemma. 4, we have 
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regular 


quasi-regular 


singular 


RCLT 


d/2 


5/2 


A 


SF 


d/2 


.9/2 


V 


G n 


d/{2n) 


3/(271) 


((A - v)/p + u)/n 




-d/(2n) 


-g/(2n) 


((A - v)//3 - v)/n 



Table 1: Regular, Quasi- Regular, and Singular 



Then, because A = | holds from Lemmal, we obtain the Theorem. (Q.E.D.) 

Example. 4 By the main theorem of this paper, the real log canonical thresh- 
old and the singular fluctuation of Example. 1 are A = v = 1. Also those of 
Example. 3 are A = v = 2. 

5 Discusion 

Let us discuss the result of this paper from the two different points of view. 
Firstly, we study the theoretical aspect and then the practical aspect. 

5.1 Theoretical point of view 

In the present paper, we introduced a new concept, a quasi-regular case. A 
quasi-regular case is not a regular case, but it has the same property as the reg- 
ular case. Table. 1 shows comparison of the real log canonical threshold (RLCT), 
singular fluctuation (SF), the generalization error G n , and the training error T n . 

Even for the general singular cases, real log canonical thresholds have been 
clarified in several cases. However, this paper is the first case in which the 
singular fluctuation was clarified. In general singular cases, it is conjectured 
that the real log canonical threshold is not equal to the singular fluctuation. To 
clarify such conjecture is the future study. 

5.2 Practical point of view 

In applications, even if both birational invariants are unknown, the generaliza- 
tion error can be estimated from the training error and the functional variance 
[13] because 

E[G n ]=E[T n ] + ^E[V n ]+o(-), 
n n 

which is asymptotically equivalent to Bayes cross validation |14) . 

However, in Bayes estimation, the method how to approximate the posterior 
distribution using Markov chain Monte Carlo (MCMC) method is an important 
issue. There are a lot of parameters which determine the MCMC process, for 
example, times of burn-in, times of sufficiently updates, and so on. If we know 
the concrete values of birational invariants, then we can evaluate how accurate 



14 



the MCMC process is [S] . Therefore, the quasi-regular cases are appropriate for 
evaluating MCMC process. It is the future study to evaluate MCMC process 
using the quasi-regular cases. 

6 Conclusion 

In the present paper, a new concept, a quasi-regular case, was firstly proposed, 
and its theoretical foundation was constructed. A quasi-regular case is not a 
regular case but a singular case, whereas it has the same property as a regular 
case. In a quasi-regular case, it was proved that the real log canonical threshold 
is equal to the singular fluctuation. This is the first case in which nontrivial 
value of singular fluctuation is clarified. 
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